Sequential Feature Selection and Inference using Multivariate Random Forests.
نویسندگان
چکیده
Motivation Random forest has become a widely popular prediction generating mechanism. Its strength lies in its flexibility, interpretability and ability to handle large number of features, typically larger than the sample size. However, this methodology is of limited use if one wishes to identify statistically significant features. Several ranking schemes are available that provide information on the relative importance of the features, but there is a paucity of general inferential mechanism, particularly in a multivariate set up. We use the conditional inference tree framework to generate a random forest where features are deleted sequentially based on explicit hypothesis testing. The resulting sequential algorithm offers an inferentially justifiable, but model-free, variable selection procedure. Significant features are then used to generate predictive random forest. An added advantage of our methodology is that both variable selection and prediction are based on conditional inference framework and hence are coherent. Results We illustrate the performance of our Sequential Multi Response Feature Selection (SMuRFS) approach through simulation studies and finally apply this methodology on Genomics of Drug Sensitivity for Cancer (GDSC) dataset to identify genetic characteristics that significantly impact drug sensitivities. Significant set of predictors obtained from our method are further validated from biological perspective. Availability https://github.com/jomayer/SMuRF. Contact [email protected].
منابع مشابه
A Random Forest Classifier based on Genetic Algorithm for Cardiovascular Diseases Diagnosis (RESEARCH NOTE)
Machine learning-based classification techniques provide support for the decision making process in the field of healthcare, especially in disease diagnosis, prognosis and screening. Healthcare datasets are voluminous in nature and their high dimensionality problem comprises in terms of slower learning rate and higher computational cost. Feature selection is expected to deal with the high dimen...
متن کاملRandom Forests with Missing Values in the Covariates
In Random Forests [2] several trees are constructed from bootstrapor subsamples of the original data. Random Forests have become very popular, e.g., in the fields of genetics and bioinformatics, because they can deal with high-dimensional problems including complex interaction effects. Conditional Inference Forests [8] provide an implementation of Random Forests with unbiased variable selection...
متن کاملAirborne Lidar Feature Selection for Urban Classification Using Random Forests
Various multi-echo and Full-waveform (FW) lidar features can be processed. In this paper, multiple classifers are applied to lidar feature selection for urban scene classification. Random forests are used since they provide an accurate classification and run efficiently on large datasets. Moreover, they return measures of variable importance for each class. The feature selection is obtained by ...
متن کاملSequential and Mixed Genetic Algorithm and Learning Automata (SGALA, MGALA) for Feature Selection in QSAR
Feature selection is of great importance in Quantitative Structure-Activity Relationship (QSAR) analysis. This problem has been solved using some meta-heuristic algorithms such as: GA, PSO, ACO, SA and so on. In this work two novel hybrid meta-heuristic algorithms i.e. Sequential GA and LA (SGALA) and Mixed GA and LA (MGALA), which are based on Genetic algorithm and learning automata for QSAR f...
متن کاملRandom forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations
The use of random forests is increasingly common in genetic association studies. The variable importance measure (VIM) that is automatically calculated as a by-product of the algorithm is often used to rank polymorphisms with respect to their ability to predict the investigated phenotype. Here, we investigate a characteristic of this methodology that may be considered as an important pitfall, n...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Bioinformatics
دوره شماره
صفحات -
تاریخ انتشار 2017